Python Data Cleaning for Web Scraping: JSON, MongoDB, Regex

This article continues from 10 Essential Python Data Cleaning Techniques for Web Scraping, focusing on practical data cleaning methods that modern scraping projects rely on daily.

Most real-world scraping tasks involve APIs, JavaScript-rendered data, and unstructured text. Therefore, mastering Python data cleaning for web scraping is essential for building stable and scalable crawlers.

3. JSON Data Cleaning

Today, most websites expose data through APIs, and JSON has become the dominant response format. As a result, Python developers must handle JSON efficiently.

Example: Star Wars API (SWAPI)

API endpoint:

https://swapi.dev/api/people

import requests

url = "https://swapi.dev/api/people/"
response = requests.get(url, verify=False)
json_data = response.json()

print(json_data["results"])  # Access character data

Handling Non-English Characters

When JSON contains non-English text, such as Chinese characters, you should explicitly set the encoding:

response.encoding = "utf8"
json_data = response.json()

This step prevents garbled text and ensures accurate downstream processing.

4. Storing JSON Data in MongoDB (NoSQL)

When JSON structures become deeply nested, traditional SQL databases introduce unnecessary complexity. In contrast, MongoDB handles nested documents naturally, making it a strong choice for Python data cleaning for web scraping.

Installation

pip install pymongo

Insert JSON Data into MongoDB

import pymongo

client = pymongo.MongoClient(
    f"mongodb://{user}:{password}@{host}:{port}"
)
db = client["db_spider"]
collection = db["wars_star"]

# Prevent duplicate insertion
collection.create_index("name", unique=True)
collection.insert_many(json_data["results"], ordered=False)

Query Examples

Find characters whose names contain “Le”:

db.getCollection("wars_star").find({ name: /Le/ })

Find characters appearing in a specific film:

db.getCollection("wars_star").find({
  films: { $in: ["https://swapi.dev/api/films/1/"] }
})

Because MongoDB supports flexible schemas, it simplifies storage and querying of API responses with variable fields.

5. Handling JavaScript Object Data (JSONP)

Some websites return data wrapped inside JavaScript objects rather than pure JSON. Financial websites often use this pattern.

Example: Parsing JavaScript Object Data

import demjson

# Extract JavaScript object
js_data = response.text[
    response.text.find("=") + 2 : response.text.rfind(";")
]

# Decode JavaScript object
raw_data = demjson.decode(js_data)

rank_list = [item.split(",") for item in raw_data["datas"]]

This approach allows you to convert JavaScript-style data into structured Python objects without browser automation.

6. Regular Expressions: The Universal Tool

Even with structured APIs, some data only appears inside raw HTML or text. In such cases, regular expressions provide a reliable fallback.

Single Match with

re.search

import re

html = '<div class="q-text">6,526 followers</div>'
match = re.search(r">(.*?) followers<", html)

followers = int(match.group(1).replace(",", ""))
print(followers)  # 6526

Multiple Matches with

re.findall

text = "Phone numbers: 18767543212 and 19767443218"
phones = re.findall(r"\d{11}", text)

print(phones)
# ['18767543212', '19767443218']

Regular expressions remain indispensable when APIs are unavailable or page structures change frequently.

When to Use Each Technique

Scenario	Recommended Method
API responses	JSON parsing
Nested or flexible schemas	MongoDB
JavaScript-returned objects	JSONP + demjson
Unstructured HTML/text	Regular expressions

In practice, effective Python data cleaning for web scraping combines multiple techniques rather than relying on a single solution.

Related Web Guides

This article is part of 10 most efficient data cleaning techniquestopic cluster.
You may also find the following guides useful:

Top Data Cleaning Techniques for Web Scraping Engineers (Part 1)
News Content Extraction for Web Scraping: GNE and Newspaper3k (Part 3)
pandas Data Cleaning for Web Scraping: From HTML Tables to Clean Datasets(Part 4)

Conclusion

In this chapter, you learned how to clean and process scraped data using JSON parsing, MongoDB storage, JavaScript object handling, and regular expressions. These techniques cover the majority of real-world scraping scenarios and integrate smoothly with larger crawling pipelines.

For more on extracting raw HTML data before cleaning, see:

Crawling HTML Pages: Python Web Scraping Tutorial

https://www.2808proxy.com/practical-application-of-crawler

In the next installment, we will explore more advanced data cleaning strategies that further improve crawler efficiency and data quality.In production systems, these techniques are commonly used downstream of a web scraping API.